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Abstract 

Background: In recent years many genetic variants (eSNPs) liave been reported as associated with expression of 
transcripts in trans. However, tlie causal variants and regulatory mechanisms through which they act remain mostly 
unknown. In this paper we follow two kinds of usual suspects: SNPs that alter coding regions or transcription 
factors, identifiable by sequencing data with transcriptional profiles in the same cohort. We show these 
interpretable genomic regions are enriched for eSNP association signals, thereby naturally defining source-target 
gene pairs. We map these pairs onto a protein-protein interaction (PPI) network and study their topological 
properties. 

Results: For exonic eSNP sources, we report source-target proximity and high target degree within the PPI 
network. These pairs are more likely to be co-expressed and the eSNPs tend to have a c/s effect, modulating the 
expression of the source gene. In contrast, transcription factor source-target pairs are not observed to have such 
properties, but instead a transcription factor source tends to assemble into units of defined functional roles along 
with its gene targets, and to share with them the same functional cluster of the PPI network. 

Conclusions: Our results suggest two modes of trans regulation: transcription factor variation frequently acts via a 
modular regulation mechanism, with multiple targets that share a function with the transcription factor source. 
Notwithstanding, exon variation often acts by a local c/s effect, delineating shorter paths of interacting proteins 
across functional clusters of the PPI network 
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Background 

Creating the complete human regulatory map is an active 
field of study. Many previous studies have used genomic 
analyses of gene expression, binding motifs, epigenetic 
marks and other local features to infer regulatory interac- 
tions [1-5]. In recent years it has been established that 
genetic variation can contribute an additional angle to this 
investigation [6-9]. Formally, transcription level is consid- 
ered as a quantitative trait that is altered by allelic varia- 
tion with thousands of single nucleotide polymorphism 
(SNPs) reported as associated with changes in gene 
expression [10-13]. Such markers, called expression SNPs 
(eSNPs) are further found to contribute to variation of dis- 
ease phenotypes and other clinically relevant traits [14-16]. 
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Variation in genomic DNA can affect transcription in 
multiple ways. Most intuitively perhaps, level of tran- 
scripts in cis of an eSNP may be altered due to allelic 
variation in regulatory elements [17]. Alternatively, such 
levels may be auto-regulated by changes in protein struc- 
ture that reflect variation of the sequence content of local 
transcripts. Therefore, cis eSNPs have been studied 
extensively. However, cis associations are limited in their 
ability to inform us regarding the network of regulatory 
interactions between one gene and another. This moti- 
vates more focused study of the effects of genetic variants 
on expression of distal transcripts {trans associations). 
Unfortunately, while trans eSNPs can identify down- 
stream effects and previously un-annotated regulatory 
pathways, they are harder to statistically and biologically 
justify than cis eSNPs. From a statistical perspective, 
since trans eSNPs can be associated with any distal tran- 
script, the multiple testing burden dramatically increases, 



o 



Bion/led Central 



© 201 3 Kreimer and Pe'er; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative 
Commons Attribution License [http://creativecommons.0rg/licenses/by/2.O), which permits unrestricted use, distribution, and 
reproduction in any medium, provided the original work is properly cited. 



Kreimer and Pe'er Genome Biology 2013, 14:R71 
http://genomebiology.eom/2013/14/7/R71 



Page 2 of 13 



thus only a small number of results is detected. From a 
biological perspective, more complex mechanisms are 
needed to explain trans associations. An example of such 
a mechanism is an eSNP with local cis effect on a gene 
which codes for a transcription factor known to regulate 
other genes in trans. Indeed, across multiple eSNP stu- 
dies [7,18], even when statistically significant trans or cis 
eSNPs associations are detected aplenty, the regulatory 
mechanisms by which they alter gene expression remain 
mostly unknown. 

A large fraction of SNPs identified by genome-wide 
association studies (GWAS) [10] have been reported to 
be associated with disease phenotypes [14] despite being 
neither coding, nor linked to coding SNPs in cis. 
Furthermore, since large-scale genetic studies have been 
predominantly based on SNP arrays, SNP alleles that are 
reported as associated, in studies of either disease [10] 
or gene expression [7], are often merely tags for causal 
variants, whose identity is challenging to track down. 
More generally, the multitude of phenotypes for eSNPs 
represents an opportunity for tackling the central ques- 
tion of causation in association. 

Protein-protein interaction (PPI) networks capture var- 
ious experimental data, such as from yeast two-hybrid 
systems [19], regarding the physical binding of proteins, 
and are often used to examine how these interactions are 
involved in a specific biological function. Recently, 
improved data on signal transduction and metabolic and 
molecular networks have contributed to the fidelity and 
accuracy of the reconstructed PPI networks. However, 
the data represented by these networks can sometimes be 
partial and noisy. PPI networks have been modeled as 
theoretical graphs and their topological properties exten- 
sively studied [20-22]. This provided insights pertaining 
to functional, structural and evolutionary characteriza- 
tion of these networks, primarily in model organisms. 
Genetic interactions in yeast were studied in the context 
of protein complexes network [23], motivating the inves- 
tigation of genetic variants that alter gene expression (as 
interactions) with respect to the human PPI network [24]. 
Studies of PPI networks in the context of genetic varia- 
tion have thus far focused on GWAS-detected SNPs that 
are associated with common traits and disease, reporting 
that genes that harbor such SNPs frequently code for 
interacting proteins [24-28]. Yet, such studies only con- 
sidered the PPI-network nodes that correspond to the 
associated SNP, without a PPI network node that would 
correspond to the phenotype. 

Here, we perform a comprehensive study of trans 
genetic associations and their large-scale properties as 
manifested on a PPI network. We use SNPs from sequen- 
cing data [29] that are candidates to be causal based on 
their genomic location, and then project their association 
to gene expression on a PPI network. We hypothesized 



that genes involved in true eSNP associations have dis- 
tinct PPI-network properties that differ significantly from 
spurious genes with candidate association signals. To 
address this hypothesis, we focus on trans association of 
eSNPs in exons and transcription factors (TFs), analyzing 
their properties as reflected on the PPI-network topology 
and annotations of the genes involved. Our focus on 
expression quantitative traits allows consideration of 
paths along the PPI network, whose links with genetic 
variation had previously only been studied with respect 
to SNPs, rather than the transcripts they modulate. 

Our results suggest that a significant fraction of eSNPs 
in exons act in trans through mild effects in cis, with a 
regulation mechanism that is mediated by PPI paths 
that are shorter than expected by chance and tend to 
traverse across functional clusters of the PPI network. 
These paths highlight zinc ion binding genes as a possi- 
ble mechanism of transcript-eSNP feedback across the 
PPI network. In comparison to such coding eSNPs, we 
observe that TFs harboring eSNPs and their associated 
genes create units of genes that are functionally 
enriched for biological annotations. This suggests a dif- 
ferent, modular regulatory mechanism for such TF 
eSNPs. Altogether, our analysis offers insights concern- 
ing a variety of mechanisms by which genetic variation 
at functional loci shapes the structure of human regula- 
tory networks. 

Results 

Computational framework for mapping trans associations 
onto the PPI network 

We were interested in pinpointing directly associated var- 
iants rather than indirectly imputed ones. We thus used a 
publicly available dataset of 50 fully sequenced Yoruban 
samples [29] along with their transcription profiles from 
RNA-sequencing data [30], bearing in mind that such 
available cohorts are limited in size. Due to this small sam- 
ple size, we have limited power in detecting association. 
Therefore, most candidate eSNPs can only be designated 
with various levels of uncertainty. 

We were intrigued to examine fra«5-eSNPs interac- 
tions with respect to an independent space of interac- 
tions, that is, a PPI network. Therefore, we evaluated 
two categories of candidate eSNPs that reside within 
regions along the genome with known regulatory poten- 
tial and can be mapped onto a PPI network, that is, 
exons and TFs (Figure SI and Table SI in Additional 
file 1; see Methods). Examining the distribution of 
P-values across these two categories of candidate trans- 
eSNPs, we observed that candidate eSNPs within exons 
show evidence of including true positive eSNPs (Figure S2a 
in Additional file 1), as been previously shown [31]. 
By contrast, eSNP candidates in TFs show association 
signal distributions consistent with random expectation 
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(Figure S2b in Additional file 1). We further examine if TF 
candidate eSNPs exhibit qualities that are different from 
random. We hypothesized that a single TF will be asso- 
ciated with multiple transcripts via eSNPs. To address this 
hypothesis, we created 1,000 permuted sets of pairs of TF 
and transcript (see Methods). We observed that the num- 
ber of multiple associated transcripts is significantly higher 
(Wilcoxon rank sum test P <0.05) in the real dataset (973 
out of 1,000 permuted sets, empirical P-value = 0.027). Fol- 
lowing these two observations, we focused on eSNPs 
within exons as the first subject of our investigation, and 
compared them to eSNPs within the span of transcription 
factor genes. We set out to characterize and compare these 
two modes of trans regulation. 

For each candidate eSNP that is associated with levels 
of a transcript in trans, we denoted this transcript as the 
'target' of the eSNP. When this eSNP was located within 
an exon or in the span of a TF, we defined this gene as 
'source'. We attempted to characterize eSNPs interac- 
tions on the molecular level by mapping these pairs of 
source-target genes onto a PPI network (see Methods 
and Figure 1) and studied their functional annotations 
and topological properties. 

Identifying topological properties of exonic eSNP 
interactions 

We first considered pairs of exon eSNP source and target 
that demonstrated an association signal which was signif- 
icant exome-wide for a particular transcript (association 



P <10' ). We observed such pairs to be significantly clo- 
ser {P = 0.03) on the PPI network when compared with 
randomly permuted candidate eSNPs (see Methods). 
Beyond pairwise properties of sources and targets, we 
further attempted to characterize each by their single- 
node features. Specifically, the targets of exon eSNPs had 
significantly higher (P = 0.003) degree than expected 
based on random pairs. 

We reasoned that the cutoff of association P-value we 
used {P <10'^) was in many ways arbitrary, as we were 
interested in the statistical properties of the set of 
results rather than the significance of a particular result 
amid the testing burden. We therefore considered multi- 
ple P-value thresholds of eSNP association, and at each 
threshold evaluated topological properties of eSNP 
source and target pairs, while assessing significance vis- 
a-vis randomly permuted sets of candidate eSNPs in 
exons (see Methods). We observed that the lower the 
association P-values for source-target pairs, the more 
their topological properties differed compared with ran- 
dom pairs (Table S2 in Additional file 2). For example, 
for source-target pairs of exon eSNP, the average target 
degree among the 52 pairs exceeding an association 
P-value cutoff of 10'^'^ was 16.42, but it reached as 
much as 22.22 among the more focused set of 22 pairs 
that exceeded association P-value cutoff 10' ' . These 
averages were each significant {P = 0.02 and 0.006, 
respectively) when compared with permuted pairs of 
exon eSNPs, whose target degree was only 9.36 on 




Figure 1 trans associations (solid and dashed red straight arrows) on a protein-protein interaction network. An eSNP (red tick mark) that 
resides within a known exon (left) or TF (middle) maps to the PPI network (right). The source gene (blue 5) is associated in trans with the levels 
of a target transcript (green t). PPI network edges are denoted in black, and define the shortest path between the exon source and its target 
(solid red curved arrow). The association between an eSNP within a TF source and its gene target is denoted by a dashed red curved arrow. 
eSNP, expression single nucleotide polymorphism; PPI, protein-protein interaction; TF transcription factor. 
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average. These trends are consistent with properties of 
true positives being diluted by false positives at less sig- 
nificant P-value thresholds. We quantified such trends by 
regressing each topological property on the negative 
loglO of the association P-value (Figure 2). We confirmed 
that for exonic source-target pairs, network distance 
decreased and the target degree increased with the signif- 
icance of association (Spearman rank correlation coeffi- 
cients r = -0.98 and 0.97, respectively; permutation 
/'-value P = 0.001 and 0.002, respectively - see Methods). 
These results highlight unique properties of part of the 
transcripts whose trans regulation is due to coding 



variation. Specifically, we show that loci implicated by 
eSNPs encode for proteins that physically interact in a 
non-random fashion. Furthermore, target proteins are 
likely to interact with significantly more nodes of the 
PPI network than expected by chance. 

Characterization of exon and transcription factor sources 
and targets 

Based on these results, for further analysis, we focused on 
the maximal P-value cutoff of 10'^ *^^, for which all topo- 
logical properties showed significant difference between 
true source-target pairs of exon eSNPs and random ones 
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Figure 2 Topological properties on a protein-protein interaction network versus exonic source-target association significance. 

Averages for (a) distance between source and target, (b) source degree and (c) target degree are evaluated across source-target pairs of 
candidate exon eSNPs at varying association p-value thresholds {+). The average of randomly permuted pairs (dashed horizontal line) is shown 
for permuted pairs and Spearman's rank correlation coefficient (denoted r) is listed when significant at P <0.05 (denoted p). 
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(Wilcoxon rank sum test P <0.05), (Figure S3 and 
Table S3 in Additional file Is Table S2 in Additional 
file 2). 

There were 343 pairs of source and target and 295 
unique pairs, 59 of them on the network. Of these pairs, 
318 (92.71%) were on different chromosomes and 
25 (7.29%) were on the same chromosome, at least 
1 Mb apart (See Table S4a in Additional file 1). At this 
cutoff there were 333 unique eSNPs in exons, 286 unique 
gene sources and 267 unique gene targets (Table S5 in 
Additional file 3). When comparing the effect sizes (abso- 
lute values of betas in the linear regression) of 929 pre- 
viously published cis expression quantitative trait loci 
(eQTLs) [30] with the distribution of exonic and TF 
trans eSNPs effect sizes, we found that the trans effect 
sizes (mean 1.198) were significantly higher than those of 
corresponding cis effects (mean 0.964; Wilcoxon rank 
sum test P-value <2.25 x 10' and 3.56 x 10' for exonic 
and TF eSNPs, respectively; Figure S4 in Additional file 
1). We binned eSNPs and SNPs in exons by first, middle 
and last exons (See Figure S5 in Additional file 1). 
We also examined the position of the eSNP along the 
transcript and compared these results to SNPs in exons 
(See Figure S6 in Additional file 1). We observed that 
these trans exonic eSNPs tended to be located along 
middle exons, rather than in first or last exons (Fisher's 
exact test P-value <0.009). We further observed that they 
tended to lie farther away down the transcript (Wilcoxon 
rank sum test P = 0.0058). These results were different 
from what was observed for cis eQTLs. Montgomery 
et al. [32] reported that eQTLs with higher confidence 
were located in the first and last exons significantly more 
than in middle exons. 

The combined set of exon sources was enriched for 
major histocompatibility complex protein genes (false 
discovery rate (FDR) <0.046) with concordance to find- 
ings in previous studies, indicating human leukocyte 
antigen SNPs were 10-fold enriched for ira«s-eSNPs 
[33]. We further observed that the set of target genes 
was enriched for multitude functional processes (see 
Table S6 in Additional file 4 for full list of annotations). 
The three highest scoring functional annotations of the 
target set, macromolecule modification, phosphatidylino- 
sitol-3,5-bisphosphate binding and protein modification 
process, provide additional support for the role of exo- 
nic eSNP targets as network hubs [34]. 

For further investigation and comparison, we considered 
source-target pairs of TF candidate eSNPs, a set with simi- 
lar order of magnitude, corresponding to association sig- 
nals passing the P-value cutoff of lO'*'. There were 370 
such pairs of TF source-target, 193 of them unique, 58 of 
which were on the network. Of these pairs, 359 (97.03%) 
were on different chromosomes and 11 (2.97%) were on 
the same chromosome, at least 1 Mb apart (See Table S4b 



in Additional file 1). There were 358 unique eSNPs in TFs, 
77 unique TF sources and 192 unique targets (Table S5 in 
Additional file 3). Out of the 358 unique eSNPs in TFs, 15 
were in exons, significantly more than expected by chance 
(hypergeometric P-value <1.8xl0'*). When we examined 
the combined set of TF targets, we observed that this gene 
set was enriched for various annotation categories (see 
Table S6 in Additional file 4 for full list of annotations). 

To further establish the association between the 
source and target genes, we examined the co-expression 
between eSNP source and target for all candidate pairs 
of associated genes in this dataset by evaluating Spear- 
man's rank-correlation coefficient r. For pairs of exon- 
source eSNPs and their corresponding targets, the abso- 
lute value of r was significantly higher than expected 
from the entire distribution of co-expression measure- 
ments in this dataset (Wilcoxon rank sum test 
P <5.4xl0' ). By contrast, for pairs of TF-source eSNPs 
and their corresponding targets, there was no significant 
difference in terms of co-expression. We observed the 
fraction of non-synonymous SNPs to be 0.082 out of 
exon eSNPs, which was higher than their overall fraction 
0.071 among all exonic SNPs [35] (Fisher exact P 
approximately 0.1). For each eSNP we examined cis 
effects that were too mild to be detected at genome- 
wide significance threshold by testing for its association 
with the expression of its source gene. In total, 50 pairs 
of exonic eSNP and source gene were nominally (P < 
0.05) cis associated, out of 286 such unique sources (P = 
3.6 X lO''^^). We estimated how many of the SNPs in 
exons have a cw-effect (linear regression P-value <0.05) 
on the expression of their host gene. We found that out 
of 97,135 exonic SNPs, 9,661 showed ds-effect on their 
host gene at the nominal significance level (P <0.05). 
Compared to this background distribution, the observed 
50 out of 286 trans eSNPs having such c/s-effects is sig- 
nificantly more than expected by chance (Fisher's exact 
test P-value <9.6 x 10' ). This provides additional sup- 
port for the ci5-effect phenomena. For comparison, we 
did not observe a nominally significant cis effect 
between TF eSNP and its source gene more than 
expected by chance (3 out of the 66 TF sources in this 
dataset). These results suggest a mechanism where exo- 
nic variation often operates in trans eSNPs via alteration 
of gene expression in cis, and the source and target 
genes have correlated expression. 

TFs are known to control the transcription of multiple 
genes; we were therefore interested in whether we 
observed the same phenomena in TF variation. Each TF 
source forms, along with its targets, a set of genes that we 
called a unit. We observed that these units tended to be 
enriched for functional annotation categories. Specifically, 
for the 33 TF sources with two target genes or more 
(Tables S7 and S8 in Additional file 1), 26 out of 33 define 
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units that are functionally enriched (two or more anno- 
tated genes, FDR <0.05) [36] in KEGG [37] and GO [38] 
categories (Table S9 in Additional file 5). Interestingly, 
eSNP targets did not tend to share exon sources. Specifi- 
cally, out of 286 unique sources, 278 had a single target, 7 
{AI<NA, CDK7, BLK, ATP5G1, RPL8, TRAPPC12, MUC2) 
of the remaining ones had two, and one (HLA-C) had 
three (Table S5 in Additional file 3). The difference 
between the number of associated targets in TF and exon 
variation was statistically significant (Wilcoxon rank sum 
test P < 3.4 X 10' ). These results support the hypothesis 
that TF variation frequently acts via a modular regulation 
mechanism, with multiple targets that share a function 
with the TF source. 

We systematically looked for pairs of TF source-target 
that were experimentally validated as binding. We found 
such enrichment, with 6 out of 34 TF source-target pairs 
compared to 551 out of 6,904 random pairs (Fisher's 
exact test P <0.05, see Methods) in a database reporting 
binding of TFs to DNA, based on chromatin immunopre- 
cipitation (ChlP)-X experiments [39]. We used the data 
in [40] to find the closest DNasel hypersensitive site 
(DHS) window to the gene target, and examined whether 
the TF eSNP was associated with the DHS levels in this 
window. We found that 33 of 370 such pairs of TF eSNP 
and gene target were significantly associated {P < 0.05) 
indicating significant enrichment {P < 5.5 x 10' ) of this 
phenomenon. This enrichment was not an artifact of TF 
eSNP ascertainment: we tested the association of 29,212 
TF SNPs to DHS levels in a randomly picked DHS win- 
dow; as expected by chance, 1,400 of these SNPs showed 
such association at the nominal significance level, P < 
0.05. Compared to this background distribution, the 
observed set of 33 out of 370 tram eSNPs having such 
association was significantly larger than expected by 
chance (Fisher's exact test P-value < 6 x 10'*). This 
shows that even in a small sample size where the number 
of true positives is diluted with false positives, we still 
recover a true signal. 

We were intrigued by potential connections between 
source-target pairs and cluster properties in the PPI net- 
work. Therefore, we partitioned the PPI network into 
clusters of genes, optimizing the modularity measure 
[41]. Out of the resulting 249 PPI clusters with two 
genes or more, 225 (90%) demonstrated functional 
enrichment for a biological category (Table SIO in Addi- 
tional file 6). TF source-target pairs were found in the 
same PPI clusters more than expected by chance: 26 out 
of 58 TF pairs compared with 26,966 out of 100,000 
random pairs (Fisher's exact test P < 0.0043). 

As an illustration for our results, we show an example 
(Figure 3a) of a specific source and its gene target, exam- 
ining transcription factor 7-like 2; T-cell specific, HMG- 
box {TCF7L2) and its transcript target transducin-like 



enhancer of split 4 {TLE4). There was a significant cis 
effect {P < 0.012) of the associated intronic eSNP 
rs7087006 with the expression of TCF7L2, but the co- 
expression correlation of the source and target was not 
statistically significant in this dataset. TCF7L2 and its five 
targets (unit number 28, Table S8 in Additional file 1) 
comprise a unit that was enriched (two out of six) for cell 
proliferation (FDR <0.03; Table S9 in Additional file 5). 
This TF plays a key role in the Wnt signaling pathway, 
activating v-myc avian myelocytomatosis viral oncogene 
homolog {MYC) expression in the presence of catenin 
(cadherin-associated protein), beta 1, 88kDa {CTNNBl). 
The gene target TLE4 within the PPI network is a tran- 
scriptional co-repressor that represses transactivation 
mediated by TCF7L2 and CTNNBl. These annotations 
implicate that TCF7L2, TLE4 and MYC act as the net- 
work motif incoherent type-l-feed-forward loop (a pulse 
generator and response accelerator) [42] where the two 
arms of the feed-forward loop act in opposition: TCF7L2 
activates MYC (in the presence of CTNNBl) but also 
represses MYC by activating the repressor TLE4 (via an 
eSNP). We note that TCF7L2 harbors the common allele 
most strongly associated with increased risk of type 2 dia- 
betes. Correspondingly, TLE4 was recently discovered as 
a T2D locus [43]. Specifically, TLE4 encodes a protein 
that forms complexes with TCF proteins, including 
TCF7L2, to modulate transcription at target sites [44]. 
The source and target are part of the same PPI network 
cluster, which is enriched (1,257 out of 4,627) for regula- 
tion of transcription (FDR <2.4 x 10'®**, Table SIO in 
Additional file 6; Figure 3a). This demonstrates a case of 
shared function between a source TF and its target. 

By contrast, only 19 (32%) of exon eSNP sources were 
found in the same PPI network cluster as their respec- 
tive single targets, consistent with chance expectation 
(Methods). Yet, as such pairs were linked by relatively 
shorter paths (Figure 2a), it follows that coding variants 
affect transcription in trans not in a modular way but 
rather in a linear fashion that defines shorter paths than 
expected by chance. We recorded the proteins along 
such paths (Table Sll in Additional file 1) and evalu- 
ated the enrichment of functional annotation for each 
path (Table S12 in Additional file 7). 

We show an example (Figure 3b) of exon source and its 
gene target, examining the path between gene source p53- 
induced death domain protein (PIDD) and gene target 
polo-like kinase 3 (PLK3); path number 18, Tables Sll in 
Additional file 1 and S12 in Additional file 7). This path 
was enriched for the p53 signaling pathway (FDR <0.01, 
Table S12 in Additional file 7). PIDD promotes apoptosis 
downstream of the tumor suppressor as a component of 
the DNA damage/stress response pathway that connects 
p53 to apoptosis. The gene target PLK3 is a serine/threo- 
nine kinase that plays a role in regulation of cell cycle 
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Figure 3 Examples of transcription factors and exon source-target pairs. An eSNP (red tick mark) along a source gene (blue circle), either in 
an exon or TF (blue rectangle), is associated (solid red line for exon, dashed for TF) with levels of transcription of the target gene (green circle). 
The source and target genes interact via nodes (black circles) and edges (black solid lines) in the PPI network. Each node belongs to a PPI 
cluster (purple cloud) with a functional annotation, (a) Network motif 11-FFL [42]: TCF7L2 activates MYC (in the presence of ClNNBl) but also 
represses MYC by activating the repressor TLE4 (via an eSNP) . (b) The shortest path on the PPI network between PIDD source and its gene 
target PLK3. Binding sites of TP53 were found in the promoter of PLK3. TP53 is annotated as a zinc ion binding protein. There was a significant 
correlation between the expression of the source and target genes. TCF7L2, transcription factor 7-like 2; T-cell specific; TLE4 transducin-like 
enhancer of split 4; MYC, v-myc avian myelocytomatosis viral oncogene; catenin (cadherin-associated protein), beta 1, 88kDa {CTNNBl); PIDD, p53- 
induced death domain protein; PLK3, polo-like kinase 3; EFEMP2, Epidermal growth factor-containing fibulin-like extracellular matrix protein 2; 
TP53, tumor protein p53. 



progression and potentially in tumorgenesis. Epidermal 
growth factor-containing fibulin-like extracellular matrix 
protein 2 {EFEMP2)and tumor protein p53 (TP53) 
reside along the shortest path between PIDD and PLK3 
(Figure 3b). There is evidence from ChlP-ChIP and ChlP- 
seq experiments that TP53 has binding sites in the promo- 
ter of PLK3 [39] and it is annotated as a zinc ion binding 
protein. Furthermore, the combination of a pair of genes 
with TF-DNA and PPI edge between them is a known net- 
work motif (mixed-feedback loop) [45], suggesting a 
mechanism by which the expression of the target gene is 
altered. In support of this, the co-expression correlation of 
the source and target genes was significant (Spearman 
rank-correlation test r = 0.3223, P <0.02). The exon gene 
source and target reside in different PPI network clusters: 
PIDD resides in a cluster that is enriched for regulation of 
cell death (FDR <4.5 x lO'*^, Table SIO in Additional file 6) 
and PLK3 resides in a cluster that is enriched for regula- 
tion of transcription (FDR <2.4 x 10'**^ Table SIO in Addi- 
tional file 6). 

These results beg a mechanistic explanation that would 
clarify how the network interaction at the protein level is 
leading to the observed changes in transcript levels. 



Fortunately, examination of the genes along the reported 
paths provides a plausible answer, as they are strongly 
enriched for zinc ion binding proteins. Specifically, when 
we examined the enrichment for annotations of genes 
along shortest paths in the real dataset, we observed 410 
enriched categories (minimum of 10 genes from a cate- 
gory, FDR <0.05; Table S13 in Additional file 8). For 
comparison, across 1,000 permuted datasets we observed 
a total of 1,870 categories satisfying the same enrichment 
criteria. We focus on the six categories that were 
enriched in real data and not in permutations: ion bind- 
ing, metal ion binding, cation binding and intracellular, 
zinc ion binding and transition metal ion binding 
(Table S13 in Additional file 8). We compared two prop- 
erties in real versus permuted datasets: first, the number 
of genes from each category (empirical /"-values 0.005 
and 0.014 for zinc ion binding and transition metal ion 
binding respectively); and second the number of paths 
where we observed at least one gene from each category 
(empirical P-values 0.016 and 0.038 for zinc ion binding 
and transition metal ion binding respectively). These 
results were replicated in a second permuted dataset. For 
comparison, only 7 and 10 out of the 404 joint categories 
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achieve an empirical P-value lower than 0.05 for these 
two properties respectively. These results indicate that 
the genes in real paths were enriched for zinc ion bind- 
ing, which is associated with regulation of transcription, 
suggesting a possible mechanism by which the expression 
level of the target transcript is modified. 

Discussion 

We present a computational approach to study the 
characteristics of trans regulation. We observed that 
candidate eSNPs within exons exhibited an overabun- 
dance of significant association signals. We consequently 
focused on eSNPs that resided within an exon of a 
source gene, and were associated with the expression 
level of a different gene target. We observed that candi- 
date eSNPs within TPs were associated with a higher 
number of transcripts than expected by chance. We sub- 
sequently examined eSNPs that resided within the span 
of source TPs. We mapped these pairs of source and 
target onto a PPI network and analyzed their topological 
properties. 

We applied our approach to publicly available genetics 
and genomics [30] data from the same samples. We 
demonstrated that, by combining association data with 
information on PPI, it is possible to unravel topological 
properties for the two tram association types. We found 
that for an eSNP exon source and its gene target, the 
stronger the association, the closer the source-target dis- 
tance and the higher the target degree in the PPI network. 
Expression analysis showed these source-target pairs to be 
frequently co-expressed, and that these exon eSNPs often 
had significant cis effects on the expression of the source 
genes. The observed phenomenon of exonic variation leav- 
ing a signature on PPI paths raises speculations regarding 
the mechanisms of transcription regulation. Previous stu- 
dies have indirectly tackled these speculations regarding 
the connection between eSNP regulation and the PPI 
space. Specifically, Rossin et al. found that PPI connec- 
tions between loci defined in GWAS of a specific disease 
were more densely connected than chance expectation 
[24], and Nicolae et al. [14] observed that SNPs found in 
GWAS were more likely to be eSNPs. The comprehen- 
siveness of our work relied on combining eQTL data with 
the PPI network and not merely GWAS data, as described 
in previous studies [27]. This allowed us to examine 
source-target connections across the network, rather than 
be limited to studying the source nodes as in GWAS-PPI 
analyses. The novel observation is that the genetic varia- 
tion that modifies PPI network properties is associated 
with a normal expression landscape and not only with 
extreme cases of disease. 

We attempted to go beyond topological results and 
shed light on the regulatory mechanism by which gene 
expression of the target gene is altered in these shorter 



paths. We systematically compared genes along real and 
permuted shortest paths and found enrichment for ion 
zinc binding proteins, suggesting a plausible mechanism 
by which the expression level of the target transcript is 
modified. More generally, the paths of interacting pro- 
tein pairs, from a source protein to the target protein, 
were consistent with concatenation of two pathways 
(Figure S7 in Additional file 1). The prefix of the path 
was consistent with a regulatory pathway, leading to 
some regulatory protein (TP or other) that affects 
expression of the target. The suffix of the path may 
match a self feedback loop in reverse: from the target 
protein back to the same regulatory protein [46]. 

We demonstrated it is possible to characterize regula- 
tory variation in TPs. We observed that eSNP TP sources 
and their gene targets create units of genes that are 
enriched for functional annotations. When decomposing 
the PPI network to clusters, we observed that these 
source-target pairs tend to reside within the same cluster. 

The design choices for a study of this kind convey a few 
methodological limitations. First, because we were inter- 
ested in detecting putatively causal variants based on 
their exact genomic location, we used a dataset of fully 
sequenced individuals along with their transcription pro- 
files. Such cohort sizes are limited in size, reducing the 
power to detect association and allowing us to see only 
the strongest effects. Second, we were interested in 
understanding the mechanisms underlying eSNPs inter- 
actions. This required the use of a well-established inter- 
action network. We examined our results on a PPI 
network, rather than a TF-DNA interaction network or 
co-expression network derived from this dataset, to 
establish a broad and independent network of interac- 
tions. Overall, both the raw datasets [29,30] and support- 
ing databases [37-39,47-49] in this work were noisy and 
limited. That we observed statistically significantly plausi- 
ble results in such a small dataset combined with noisy 
databases is encouraging. Potentially, an increase in sam- 
ple size may enable detection of eSNP associations at 
more significant P-values for even milder effects. 

Conclusions 

Over the last decade, causal interpretation of genetic 
association signals for common variants and common 
traits had been impeded by two hurdles. First, many of 
the signals had been obtained as indirect association to 
proxy genetic markers, without access to the directly 
and causally associated variant. Second, often the trait 
under investigation was not understood at the molecular 
mechanistic level well enough to decipher the connec- 
tion between variant and phenotype. This work bridges 
the gap between association and causality by considering 
both direct association to sequencing-ascertained var- 
iants, as well as expression quantitative traits. The ability 
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to tie together these loose ends of genetic association 
using an interaction map constitutes a notable stride 
towards understanding the thousands of such connec- 
tions that recent genetics have discovered. 

Our main findings suggest two modes of tram regula- 
tion via genetic variation in exons and TFs. Exonic var- 
iation possibly acts through mild cis effects that alter 
the expression of the source gene and delineates shorter 
paths between functional clusters (Figure 4a), and exo- 
nic eSNP targets might play an important role in the 
PPI network as hubs. TP variation frequently acts via a 
modular regulation mechanism, with multiple targets 
that share a function with the TP source (Pigure 4b). 

Puture studies could extend the approach presented 
here to investigate how genetic variation in different 
meaningful genomic locations (for example, enhancers, 
insulators, miRNAs) correlates with gene targets. Data- 
sets that combine sequenced variants coupled with gene 
expression and phenotypic traits are limited in human, 
but available for other model organisms [50,51]. It 
would be insightful to combine this type of study with 



phenotypic data, to see how trans association tracks 
with phenotypes. Specifically, applying our approach to 
samples under various conditions (for example, disease), 
could improve understanding of condition-specific regu- 
latory processes [24]. Moreover, considering genetics- 
genomics data across different tissues along with a 
tissue-specific PPI network [52] could be telling regard- 
ing the underlying regulatory mechanisms characterizing 
these tissues. 

Methods 

Data details and processing 

We analyzed a cohort of 50 Yoruban samples, for which 
genotypes of SNPs that are fully ascertained from 
sequencing data [29] along with RNA-sequencing data 
[30] are publicly available. Briefly, the raw dataset con- 
sists of 10,553,953 genotyped SNPs and expression mea- 
surements (quantile-quantile normalized values) of 
18,147 genes with Ensembl gene ID across these 50 sam- 
ples. Standard filters have been applied to the genetic 
data: minor allele frequency >0.05, SNP missingness 





Figure 4 Summary illustration - two suggested modes of frans regulation, (a) Exon variation often acts by a local cis effect, delineating 
shorter paths of interacting proteins across functional clusters of the PPI network, (b) TF variation frequently acts via a modular regulation 
mechanism, with multiple targets that share a function with the TF source. (See Figure 3 legend for further details.) eSNP, expression single 
nucleotide polymorphisms. 
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rate <0.1 and individual missingness rate <0.1 [53]. After 
filtering, data for analysis consist of 50 samples with 
7,206,056 SNPs. 

Association testing 

For association analysis, we considered only SNPs that 
resided within candidate regulatory regions along the 
genome. For tram association, we tested for association 
between a SNP and every gene; we considered SNPs 
within the span of known exons and TFs (including 
introns) [54]. We tested for association using linear 
regression (Figure S2 in Additional file 1). 

Obtaining a random distribution of association test 
statistics 

Examining the random distribution of association tests 
was helpful in evaluating the empirical significance of 
results. This was achieved by generating 100,000 ran- 
dom pairs of sources and targets for exonic and TF var- 
iation separately. We used a strict randomization 
process of edges switching. We picked a source gene 
from all sources in the real data; we then picked a target 
gene from all targets in the real data with a P-value cut- 
off of 10' . When evaluating the number of targets per 
TF source, we created 1,000 sets of random TF source 
and gene target pairs; each set contained 370 such pairs 
corresponding to 370 TF source-target pairs at a 
/"-value cutoff of 10' in the real data. 

Identifying topological properties of source-target pairs 
projected on the PPI network 

We used the PPI network provided by the Human Protein 
Reference Database [49]. The undirected network contains 
9,671 nodes and 37,041 edges. For each node, we calcu- 
lated its degree: the number of edges incident on the 
node. We defined a distance between every two nodes as 
the number of edges on the shortest path between them. 
All pair-wise shortest paths were determined using the 
Floyd- Warshall algorithm [55]. In cases where the network 
had more than one connected component, nodes from 
two different components were defined to have a distance 
of twice the maximal distance obtained within the 
components. 

Identifying topological trends across association P-values 

For exons, we observed the emergence of true positive 
associations between P-values 10'^ and 10'^ (Figure S2a 
in Additional file 1). Therefore, we focused on P-values 
<10' and sorted all source-target pairs according to the 
significance of their association signal. We considered 
each prefix of this list, that is, each subset of source-tar- 
get pairs exceeding a particular threshold, for signifi- 
cance of association signal. For each such subset, we 



reported each one of the topological properties defined 
above averaged over the subset. We calculated Spear- 
man's correlation coefficient between significance 
thresholds and each of these cumulative averages. In a 
similar process, we randomly chose an equal number of 
arbitrary source-target pairs on the PPI network. Adding 
these pairs one by one created a distribution of analo- 
gous cumulative averages for permuted pairs. We 
recorded the Spearman correlation coefficient for these 
100,000 permuted distributions. We calculated the 
empirical P-value for the significance of the observed 
correlation coefficients by counting the number of times 
when permuted r > real r and divided this by the num- 
ber of permutations. 

Expression analysis 

We calculated all pairwise co-expression correlations for all 
gene pairs in the dataset using Spearman rank-correlation 
test, and therefore obtained the distribution of the correla- 
tion coefficient r. To determine whether the distribution of 
r between source-target pairs differed from its background 
distribution, we employed the Wilcoxon ranked-sum test. 

Enrichment of eSNPs for cIs effects 

We examined whether eSNPs that were associated with 
a target's expression level also affected expression levels 
of the corresponding source. We tested this by consider- 
ing, for each source-target pair, the one eSNP most 
associated to the expression for the target. We tallied 
the source-target pairs for which this eSNP was also sig- 
nificantly associated (P < 0.05) with the expression level 
of the source. Under the null, the number of such pairs 
is a random variable that is binomially distributed. Bin 
(n = number ofunique source genes, P = 0.05). 

Unit and path annotation 

We defined units of genes by considering a TF source and 
its gene targets. We examined shortest paths within the 
PPI network between eSNP exon source and its gene tar- 
get. The enrichment of units and paths with gene subsets 
from the Gene Ontology [38], and KEGG [37] databases 
was calculated by Genatomy [36]. We reported only units 
or paths with annotations that had a significant FDR of 
0.05 or better. The description of genes in units or paths is 
cited from the National Center for Biotechnology Informa- 
tion Gene database and GeneCards [56]. 

Finding transcription factor source-target pairs in the 
experimental database 

The Chip Enrichment Analysis (ChEA) database [39] 
represents a collection of interactions describing the 
binding of transcription factors to DNA, collected from 
ChlP-X (ChlP-chip, ChlP-sequencing, ChlP-positron 
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emission tomography and DNA adenine methyltransfer- 
ase identification) experiments. For each TF source and 
target, we examined if they were present in ChEA. We 
repeated the same procedure for 100,000 permuted 
pairs of a random TF source and a random gene target. 
We then compared, using Fisher's exact test, the num- 
ber of pairs in ChEA between real and permutation 
pairs, out of all pairs where the TF source was included 
in the database. 

Finding PPI network decomposition to clusters 

The decomposition of the PPI network to clusters was 
computed by using the Louvain algorithm presented in 
[57]. This is a heuristic method that is based on modu- 
larity optimization. The method consists of two phases 
and partitions the network into clusters such that the 
number of edges between clusters is significantly less 
than expected by chance. The method provides a math- 
ematical measure for modularity with network-size nor- 
malized values, ranging from 0 (low modularity) to 1 
(maximum modularity). This method has been pre- 
viously applied to various biological networks [58] and 
specifically to a PPI network [59]. 

Significance of source and target residing in the same PPI 
cluster 

For each exon and TF source-target pair, we recorded 
whether both source and target resided in the same PPI 
cluster. We repeated the same procedure with 100,000 
permuted unique source-target pairs from nodes on the 
PPI network. We then compared the number of cluster 
co-occurrences between real data and permutations 
using the Fisher exact test. 

Comparing shortest paths annotation content 

We recorded all genes along the shortest paths 
between exonic sources and targets, both in real and 
permuted data. We then looked for enrichment in this 
set of genes (at least 10 genes per category, FDR 
<0.05). We created sets of 1,000 permuted 55 shortest 
paths (from the 17,564 shortest paths in permutations) 
that followed the exact length distribution of the 55 
real paths. For each one of the six categories that was 
not enriched in permutations, we performed two ana- 
lyses: first, we counted how many genes from each 
category appeared in the real paths (with repetitions, 
that is if gene x from category Y appeared in two 
shortest paths we counted it twice); and second, we 
counted how many of the 55 paths had at least one 
gene from this category. We repeated the same proce- 
dures for the 1,000 permuted sets. For each category, 
we then counted how many of the 1,000 permutations 
achieved equal or greater numbers than seen for the 
real data (empirical -P- value). 



Additional material 



Additional file 1: Supplementary text and figures. 

Additional file 2: Table S2: Exon source with their corresponding eSNP 
targets, for each P-value smaller than 10"^, where a source-target pair on 
the PPI network was added, we recorded the differences between 
topological properties of random and real pairs using Wilcoxon rank sum 
test The table includes for each P-value the number of unique pairs on 
the PPI network, the rank sum test P-values and the mean value for each 
one of the topological properties (distance and source and target 
degrees) for real and random pairs. 

Additional file 3: Table S5: For all TF and exonic source-target pairs we 
give the eSNP rs number, eSNP chromosome, eSNP location, source 
gene ID, target gene ID, target chromosome and association P-value. For 
eSNPs in TF, we indicate whether they are within an exon. 

Additional file 4: Table S6: Functional enrichment analysis of combined 
sets of exon sources, exon targets and TF targets (gene sets include only 
genes that map to an Entrez ID). 

Additional file 5: Table S9: TF units' functional enrichment (gene sets 
include only genes that map to an Entrez ID). 

Additional file 6: Table S10: Functional enrichment analysis of clusters in 
the PPI network (gene sets include only genes that map to an Entrez ID). 

Additional file 7: Table S12: Functional enrichment of exon paths, 
between source and target (gene sets include only genes that map to 
an Entrez ID). 

Additional file 8: Table S13: Enriched annotations (minimum 10 genes, 
FDR <0.05) of genes along real and permuted data shortest paths, and 
gene names for the six categories that were enriched in real shortest paths. 
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